-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(html): add normalize
function for HTML entities (#4523)
#4524
Conversation
function escapeXmlRestricted(str: string) { | ||
return str.replaceAll( | ||
// deno-lint-ignore no-control-regex | ||
/[^\x09\x0a\x0d\x20-\x7e\x85\xa0-\ud7ff\ue000-\ufdcf\ufdf0-\ufffd\u{10000}-\u{1fffd}\u{20000}-\u{2fffd}\u{30000}-\u{3fffd}\u{40000}-\u{4fffd}\u{50000}-\u{5fffd}\u{60000}-\u{6fffd}\u{70000}-\u{7fffd}\u{80000}-\u{8fffd}\u{90000}-\u{9fffd}\u{a0000}-\u{afffd}\u{b0000}-\u{bfffd}\u{c0000}-\u{cfffd}\u{d0000}-\u{dfffd}\u{e0000}-\u{efffd}\u{f0000}-\u{ffffd}\u{100000}-\u{10fffd}]+/gu, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we start escaping these chars by default?
Also does HTML have the same restricted chars concept as XML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon reflection, this PR needs some more thought. I'd initially thought a sort of "baseline-compatible with both HTML and XML" would be a sensible default, given that these characters are rare in practice yet could cause problems in XML. But it turns out that entities for certain C1 control-character codepoints have different semantics in HTML than XML — for example, €
:
['application/xml', 'text/html'].map((contentType) => {
const { literal, entity } = JSON.parse(
new DOMParser().parseFromString(
'<div>{"literal": "\x80", "entity": "€"}</div>',
contentType,
).querySelector('div').textContent,
)
return { contentType, literal, entity }
})
// { "contentType": "application/xml", "literal": "\x80", "entity": "\x80" }
// { "contentType": "text/html", "literal": "\x80", "entity": "€" }
Also, I'm not sure converting those characters to entities is the right approach. Probably simply stripping them out would be more sensible in most cases.
@@ -34,15 +38,23 @@ const rawRe = new RegExp(`[${[...rawToEntity.keys()].join("")}]`, "g"); | |||
* // Characters that don't need to be escaped will be left alone, | |||
* // even if named HTML entities exist for them. | |||
* escape("þð"); // "þð" | |||
* // You can force non-ASCII chars to be escaped by setting the `form` option to `compatibility`: | |||
* escape("þð", { form: "compatibility" }); // "þð" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
form: "compatibility"
sounds confusing to me as I don't see what it's compatible with.
* | ||
* // specifying a `form` option (default is `readability`): | ||
* normalize("两只小蜜蜂", { form: "readability" }); // "两只小蜜蜂" | ||
* normalize("两只小蜜蜂", { form: "compatibility" }); // "两只小蜜蜂" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
form: "compatibility"
feels defeating the purpose of normalize
to me. How about not having this option for now?
@lionel-rowe, to not keep stale PRs open, are you happy for us to close this PR for now? You can re-open the existing PR or a new one once it is ready. |
Sure, I'll close now. |
Fixes #4523
Various bikeshedding things:
html
? If so, to what?xml
suffers the same issue but in reversehtml_and_xml
is pretty grossmarkup
seems overly vaguesgml_like
might be technically correct, but would the average web developer think to look there?escapeAllCharsAsHex
? Could be useful for users wanting finer-grained control than the 2 normalization forms, but it's a very simple function to replicateNormalizationForm
OK or is it too confusing with Unicode normalization forms (NFC, NFD, etc?)